EDA
Individual EDA of Work Variations
Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 5.300 7.900 9.251 11.600 100.000 101


Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 23.70 31.70 33.23 41.80 100.00 105


Factor w/ 4 levels "[0,20.3]","(20.3,23.9]",..: 2 2 2 3 1 4 3 4 3 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 20.30 23.90 24.12 27.70 100.00 105


Factor w/ 4 levels "[0,14.1]","(14.1,18.3]",..: 2 4 4 3 2 2 4 1 2 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 14.10 18.30 19.65 24.00 100.00 105


Factor w/ 4 levels "[0,5.4]","(5.4,8.7]",..: 3 3 3 2 1 2 3 2 2 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 5.400 8.700 9.636 12.800 100.000 105


Factor w/ 4 levels "[0,7.7]","(7.7,12.3]",..: 3 4 3 3 3 3 3 1 3 4 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 7.70 12.30 13.36 17.80 100.00 105


Factor w/ 4 levels "[0,3.5]","(3.5,5.4]",..: 2 3 4 1 2 3 1 3 2 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.500 5.400 6.109 7.900 100.000 105


Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.
Individual EDA of ethnicities
Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.80 4.00 13.78 15.32 100.00


Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.40 7.20 17.36 21.50 100.00


Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.100 1.200 4.347 4.400 91.300


Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 37.10 70.30 61.24 88.40 100.00


Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.7567 0.4000 100.0000


Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.
'data.frame': 69672 obs. of 11 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ IncomePerCap: int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
[1] 626
[1] 0
[1] 11
'data.frame': 69567 obs. of 11 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ IncomePerCap: int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
- attr(*, "na.action")= 'omit' Named int 1484 1807 2299 2499 2789 4259 4444 4448 4449 4477 ...
..- attr(*, "names")= chr "1514" "1851" "2370" "2574" ...
PCA

A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there are 12 variables to explain IncomePerCap. However, the correlation matrix shows notable correlation between some of the predictor variables. For example, Professional has notable correlations with Service, Construction, Production and Unemployment, White has notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.


There were 70k+ data points being analyzed for this The biplot on the left shows the variation on the axes of PC1 and PC2 shows that PC1 has the most variation, between approx. -6 to 8, as confirmed by the summary data below Meanwhile PC2 goes between -10 to 7 Variables: Professional, Black, Production and White are pretty evenly split up between Pc1 and PC2 Other variables such as Office, Service, Unemployed, Construction, etc. are majorly represented in PC2 as compared to PC1
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion 0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
PC8 PC9 PC10
Standard deviation 0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion 0.99849 1.00000 1.000000
The breakdown of the variation explained by each component shows that just over 50% of the variation is accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component is similar. This is further illustrated by the following graph.

Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-57889 -3154 -136 3093 39355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.463 < 2e-16 ***
PC1 -4585.05 11.71 -391.701 < 2e-16 ***
PC2 -1454.29 15.63 -93.043 < 2e-16 ***
PC3 604.54 17.96 33.664 < 2e-16 ***
PC4 994.55 20.21 49.214 < 2e-16 ***
PC5 -878.20 23.56 -37.274 < 2e-16 ***
PC6 1377.18 25.44 54.140 < 2e-16 ***
PC7 -205.74 27.20 -7.564 3.96e-14 ***
PC8 -196.99 29.35 -6.712 1.93e-11 ***
PC9 3301.06 170.10 19.407 < 2e-16 ***
PC10 -3519.28 6262.21 -0.562 0.574
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.704e+04 on 10 and 69556 DF, p-value: < 2.2e-16
Went ahead and ran a full PC regression of these components and all except the last component is significant. We also see that this regression only explains 66.5% of the variability in the dataset. The strongest variable is of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.

R Square shows variation explained in the independent variable, IncomePerCap, based off of the components The steeper increase and then petering off that occurs in the R-Square graph seems to indicate that a significant amount of the variation of the data in regards to IncomePerCap is explained using just the first component
Based on the initial analysis of the R Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.
Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-42965 -4026 -442 3546 36606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 23.34 1121.0 <2e-16 ***
PC1 -4585.05 13.06 -351.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared: 0.6393, Adjusted R-squared: 0.6393
F-statistic: 1.233e+05 on 1 and 69565 DF, p-value: < 2.2e-16
The tradeoff between parsimony and description of these two potential models makes the choice of model unclear.
Assuming we choose the more explanatory model, accounting for the number of components included by adjusted R Square, We only eliminate one component or variable from the regression so we aren’t effectively parsing down However, we have low bias since we only dropped one component
<<<<<<< HEAD # K- Means ======= ##K- Means
List of 9
$ cluster : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:2] 1.15e+12 1.34e+12
$ tot.withinss: num 2.5e+12
$ betweenss : num 4.81e+12
$ size : int [1:2] 24086 45481
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3285506 0.4200874 -0.3199683 0.2372900 0.9282858 -0.6150829
2 0.1739950 -0.2224715 0.1694500 -0.1256649 -0.4916051 0.3257379
Office Construction Production Unemployment IncomePerCap
1 -0.01890396 -0.4302842 -0.6556146 -0.5268802 37598.33
2 0.01001123 0.2278715 0.3472029 0.2790272 20114.40
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
(between_SS / total_SS = 65.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:3] 4.33e+11 3.94e+11 3.93e+11
$ tot.withinss: num 1.22e+12
$ betweenss : num 6.09e+12
$ size : int [1:3] 27175 29638 12754
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754
Cluster means:
Hispanic White Black Asian Professional Service
1 0.4318377 -0.5751979 0.3952287 -0.15378102 -0.7689991 0.6127621
2 -0.2397814 0.3399978 -0.2076592 -0.02410908 0.1329131 -0.2111302
3 -0.3629095 0.4354829 -0.3595529 0.38368702 1.3296436 -0.8149861
Office Construction Production Unemployment IncomePerCap
1 -0.02605785 0.2974405 0.51204618 0.6194822 16576.79
2 0.07327428 0.0108065 -0.07894429 -0.3101802 27821.96
3 -0.11475466 -0.6588701 -0.90756657 -0.5991303 42759.54
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 1 3 3 2 2 2 1 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 2 2 3 2 3 1 3 3 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
(between_SS / total_SS = 83.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "1" "2" "3" "4"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
$ tot.withinss: num 7.15e+11
$ betweenss : num 6.6e+12
$ size : int [1:4] 17574 17698 8266 26029
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3016974 0.4066104 -0.28173021 0.1074823 0.5711741 -0.43512866
2 0.6680011 -0.8809881 0.57590840 -0.1651807 -0.9293827 0.83254425
3 -0.3738848 0.4389987 -0.37976189 0.4559152 1.5331982 -0.92442949
4 -0.1317654 0.1850702 -0.08076331 -0.1050413 -0.2406168 0.02128076
Office Construction Production Unemployment IncomePerCap
1 0.06629532 -0.2290380 -0.4319137 -0.46098899 32840.09
2 -0.05484069 0.3137655 0.5749681 0.89883620 14433.73
3 -0.17952039 -0.7693828 -1.0184110 -0.62970642 45780.77
4 0.04953752 0.1856318 0.2240905 -0.09992813 23412.84
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 2 4 4 4 1 4 1 4 4 4 4 4 2 4 1 4 2 1 1 4 4 1 2 4 4
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 4 1 1 3 1 1 4 1 3 4 1 1 4 4 4 4 4 2 2 2 2 2 4 4 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 4 4 2 4 4 4 2 2 2 4 4 4 2 2 4 4 4 2 4 2 4 4
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
(between_SS / total_SS = 90.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 6 4 4 6 6 2 4 2 6 6 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:6, 1:11] -0.349 -0.285 0.974 0.132 -0.385 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:6] 5.33e+10 5.39e+10 6.79e+10 5.20e+10 5.61e+10 ...
$ tot.withinss: num 3.36e+11
$ betweenss : num 6.98e+12
$ size : int [1:6] 8294 13096 9901 16290 4763 17223
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 6 clusters of sizes 8294, 13096, 9901, 16290, 4763, 17223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3494717 0.4296992 -0.3360808 0.30654695 1.0897348 -0.68272359
2 -0.2854601 0.3976771 -0.2665621 0.05425283 0.4263884 -0.36762711
3 0.9739417 -1.2113147 0.7268560 -0.18905224 -1.0768116 1.07158297
4 0.1316581 -0.2288132 0.2195608 -0.13438783 -0.6075451 0.36540166
5 -0.3852568 0.4492850 -0.4005145 0.50197893 1.7117093 -1.02643109
6 -0.1925231 0.2792048 -0.1502203 -0.09190832 -0.1287054 -0.06945889
Office Construction Production Unemployment IncomePerCap
1 -0.029446162 -0.5329557 -0.7839551 -0.5661097 38948.40
2 0.089116820 -0.1457015 -0.3276215 -0.4288697 31179.25
3 -0.088194644 0.3291525 0.5983303 1.2390235 12157.33
4 0.007625546 0.2812103 0.4727567 0.2799036 18933.00
5 -0.254726418 -0.8568239 -1.1023498 -0.6538796 48913.98
6 0.060350087 0.1491981 0.1403863 -0.1974674 24809.23
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
6 4 4 6 6 2 4 2 6 6 6 4 4 4 6 2 6 3 1 1 6 6 2 4 6 6
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
4 4 2 2 1 2 1 4 1 5 6 2 2 6 4 4 6 4 3 4 3 4 3 4 4 4
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
4 4 6 4 4 4 4 4 4 4 4 6 4 4 4 4 4 6 4 4 3 4 6
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 53289943534 53860735274 67861554081 51981016101 56119303347 52399097534
(between_SS / total_SS = 95.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 2 5 7 7 2 2 7 3 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:7, 1:11] -0.396 -0.258 -0.313 1.054 0.237 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:7] 3.01e+10 3.39e+10 3.39e+10 5.06e+10 3.68e+10 ...
$ tot.withinss: num 2.52e+11
$ betweenss : num 7.06e+12
$ size : int [1:7] 3586 13111 9445 8258 13680 6091 15396
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 7 clusters of sizes 3586, 13111, 9445, 8258, 13680, 6091, 15396
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3964506 0.4542494 -0.40764714 0.5282089 1.7925081 -1.0717282
2 -0.2576300 0.3733856 -0.23385201 -0.0269068 0.1755536 -0.2474022
3 -0.3125257 0.4200419 -0.30528640 0.1527982 0.6907586 -0.4893448
4 1.0539699 -1.2761352 0.73515083 -0.1940444 -1.1025060 1.1203953
5 0.2366030 -0.3958078 0.34172930 -0.1373108 -0.7010158 0.4859066
6 -0.3575666 0.4294729 -0.34914245 0.3685170 1.2778561 -0.7833195
Office Construction Production Unemployment IncomePerCap
1 -0.29189615 -0.895033912 -1.1399021 -0.65954880 50244.25
2 0.08656937 -0.003686199 -0.1156669 -0.35053069 28266.51
3 0.06459196 -0.301790762 -0.5299635 -0.49757889 34274.90
4 -0.08602009 0.329565942 0.5903558 1.33577584 11570.50
5 -0.02053170 0.292153979 0.5252348 0.41200162 17787.86
6 -0.07512151 -0.640358239 -0.8939002 -0.59387071 41486.24
[ reached getOption("max.print") -- omitted 1 row ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 5 7 7 2 2 7 3 7 7 7 7 5 5 7 2 7 5 3 3 2 2 2 5 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
5 5 2 2 6 3 3 7 3 6 7 2 3 7 5 5 2 5 4 5 5 5 4 5 7 5
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
5 5 7 5 5 7 7 5 5 5 5 2 5 5 5 7 5 2 5 7 4 5 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 30078367255 33918666154 33924690315 50620627295 36763050021 31979820579
[7] 34290620831
(between_SS / total_SS = 96.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 1 8 7 1 1 5 7 5 1 1 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:8, 1:11] -0.21 -0.398 -0.331 -0.359 -0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:8] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:8] 2.38e+10 2.24e+10 2.47e+10 2.28e+10 2.44e+10 ...
$ tot.withinss: num 1.95e+11
$ betweenss : num 7.12e+12
$ size : int [1:8] 13055 3152 7742 5140 10623 6131 13192 10532
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 8 clusters of sizes 13055, 3152, 7742, 5140, 10623, 6131, 13192, 10532
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.2099021 0.31065143 -0.17687237 -0.08368332 -0.08389187 -0.09881344
2 -0.3981141 0.45578358 -0.40960492 0.53311507 1.81785638 -1.08784127
3 -0.3314624 0.43062774 -0.31765082 0.20067007 0.83132143 -0.55740904
4 -0.3588452 0.42873072 -0.36110010 0.40713285 1.35698554 -0.82340123
5 -0.2764009 0.38868612 -0.25353852 0.02632578 0.34875186 -0.33062038
6 1.1604954 -1.34363273 0.72389507 -0.20529827 -1.11835632 1.17173776
Office Construction Production Unemployment IncomePerCap
1 0.06138421 0.1289094 0.1064936 -0.23047133 25340.47
2 -0.30249732 -0.9092239 -1.1486967 -0.66205029 50788.01
3 0.03930789 -0.3810941 -0.6273035 -0.52118182 35892.73
4 -0.10289042 -0.6828886 -0.9379571 -0.61013558 42677.39
5 0.09089521 -0.1005031 -0.2648395 -0.40948506 30223.86
6 -0.08558404 0.3344531 0.5598170 1.46755772 10698.00
[ reached getOption("max.print") -- omitted 2 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 8 7 1 1 5 7 5 1 1 1 7 7 8 1 5 7 8 3 3 1 1 5 7 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
8 7 5 5 4 5 3 7 3 4 1 5 5 1 7 7 1 7 6 8 8 8 8 7 7 7
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
8 7 7 8 7 7 7 8 7 8 7 1 7 8 8 7 7 1 7 7 6 7 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 23806194051 22351110068 24674184543 22763698227 24425126210 32229516603
[7] 22734152115 22364091140
(between_SS / total_SS = 97.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 7 3 3 7 9 1 3 1 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:9, 1:11] -0.294 -0.347 0.049 -0.402 1.22 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:9] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:9] 1.54e+10 1.52e+10 1.73e+10 1.58e+10 2.42e+10 ...
$ tot.withinss: num 1.55e+11
$ betweenss : num 7.16e+12
$ size : int [1:9] 8025 5905 11753 2719 4994 4354 12230 9140 10447
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 9 clusters of sizes 8025, 5905, 11753, 2719, 4994, 4354, 12230, 9140, 10447
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.29398811 0.4084707 -0.2881384 0.09801206 0.5412472 -0.41887898
2 -0.34732990 0.4326622 -0.3266366 0.26353963 0.9956851 -0.63591035
3 0.04903299 -0.1030093 0.1312846 -0.13421926 -0.5427295 0.27794329
4 -0.40241766 0.4583190 -0.4144673 0.54771279 1.8491328 -1.10706894
5 1.21957721 -1.3726251 0.7042742 -0.21976591 -1.1201758 1.19133726
6 -0.35966388 0.4293714 -0.3692219 0.42709539 1.4300252 -0.86183629
Office Construction Production Unemployment IncomePerCap
1 0.092171984 -0.214882278 -0.42684953 -0.4601316 32552.93
2 -0.003486057 -0.478374937 -0.72844821 -0.5525047 37694.40
3 0.017290823 0.273752980 0.44804925 0.1820710 19710.62
4 -0.310870838 -0.926207144 -1.16439186 -0.6642217 51363.70
5 -0.083922955 0.335218762 0.54026333 1.5507180 10154.85
6 -0.136970585 -0.719826670 -0.97229327 -0.6191718 43867.61
[ reached getOption("max.print") -- omitted 3 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
7 3 3 7 9 1 3 1 7 7 7 3 3 3 7 9 7 8 2 2 9 9 9 3 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
3 3 1 9 2 1 2 3 2 6 7 1 1 7 3 3 9 3 5 8 8 8 8 3 3 3
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
3 3 7 8 3 3 3 8 3 3 3 9 3 8 8 3 3 7 3 3 5 3 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 15416927509 15194151787 17298838728 15761413559 24236871148 16554689443
[7] 17160968891 17348442911 16348725566
(between_SS / total_SS = 97.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 9 2 8 9 5 5 8 10 9 9 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:10, 1:11] 1.293 0.172 -0.361 0.733 -0.269 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:10] 1.67e+10 1.25e+10 1.39e+10 1.17e+10 1.19e+10 ...
$ tot.withinss: num 1.28e+11
$ betweenss : num 7.18e+12
$ size : int [1:10] 3740 9868 3958 7455 8799 2488 5279 10782 10229 6969
$ iter : int 2
$ ifault : int 4
- attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 3740, 9868, 3958, 7455, 8799, 2488, 5279, 10782, 10229, 6969
Cluster means:
Hispanic White Black Asian Professional Service
1 1.29347073 -1.3926502 0.65110535 -0.227144565 -1.10439711 1.2206491
2 0.17193979 -0.2995527 0.27181346 -0.131665132 -0.65475827 0.4157717
3 -0.36067127 0.4329611 -0.37594612 0.433931771 1.47068572 -0.8852975
4 0.73251794 -1.0447516 0.74170741 -0.163014767 -1.02683185 0.9384102
5 -0.26944247 0.3834500 -0.24218980 -0.002215569 0.27358175 -0.2989392
6 -0.40143948 0.4574908 -0.41624907 0.552865396 1.86667764 -1.1156001
Office Construction Production Unemployment IncomePerCap
1 -0.076264666 0.32954885 0.47907560 1.65962676 9446.23
2 -0.005854058 0.28803780 0.50879858 0.34460804 18266.13
3 -0.150092261 -0.74228280 -0.99238368 -0.63110451 44529.44
4 -0.088022816 0.32539421 0.65367907 0.92950639 14162.91
5 0.094618993 -0.05725246 -0.20068842 -0.38834330 29357.81
6 -0.323599903 -0.93488726 -1.16996086 -0.66198939 51688.15
[ reached getOption("max.print") -- omitted 4 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
9 2 8 9 5 5 8 10 9 9 8 8 2 2 9 5 8 4 10 7 9 9 5 2 8 8
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 5 5 7 10 10 8 10 3 8 5 10 8 2 2 9 2 4 2 4 2 4 2 8 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 8 4 2 8 8 2 2 2 8 9 2 4 2 8 2 9 2 8 4 2 8
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 16668900913 12479477177 13910959117 11660948482 11895068060 12674139695
[7] 12952720882 11959674961 11678714620 12133241415
(between_SS / total_SS = 98.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized along the normal distribution to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4,5,6,7,8,9, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 5 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. So a cluster of fourr would be sufficient as it would capture a sufficient amount of the data.